General base state assignment for optimal massive parallelism

ABSTRACT

General base hypercube transformations using general base perfect shuffles and Kronecker matrix products are applied to the problem of parallel, to massively parallel processing of sparse matrices. The approach is illustrated by applying the hypercube transformations to general base factorizations of generalized spectral analysis transformation matrices. Hypercube transformations lead to optimal scheduling with contention-free memory allocation at any level of parallelism and up to massive parallelism. The approach is illustrated by applying the generalized-parallelism hypercube transformations to factorizations of generalized spectral analysis transformation matrices, and in particular to Generalized Walsh-Chrestenson transformation matrices of which the Discrete Fourier transform and hence the Fast Fourier transform are but a special case. These factorizations are a function of four variables, namely, the general base p, the number of members of the class of matrices n, a parameter k describing the matrix structure and the number M of parallel processing elements. The degree of parallelism, in the form of M=p m  processors can be chosen arbitrarily by varying m between zero to its maximum value of n−1. The result is an equation describing the solution as a function of the four variables n, p, k and m.

FIELD OF THE INVENTION

A formalism and an algorithm for the general base parallel dispatch andsequencing state assignment of optimal general-base massively parallelmultiprocessing architecture are presented. Transformations of a base-phypercube, where p is an arbitrary integer, are shown to effect adynamic contention-free general base optimal memory allocation of theparallel to massively parallel multiprocessing architecture. Theformalism is shown to provide a single unique description of thearchitecture and sequencing of parallel operations. The approach isillustrated by factorizations involving the processing of matrices,which are function of four variables. Parallel operations areimplemented matrix multiplications. Each matrix, of dimension N×N, whereN=p^(n), n integer, is a sampling matrix of which the structure dependson a variable parameter k. The degree of parallelism, in the form ofM=p^(m) processors can be chosen arbitrarily by varying m between zeroto its maximum value of n−1. The result is an equation describing thesolution as a function of the four variables n, p, k and m.

Applications of the approach are shown in relation with complex matrixstructures of image processing and generalized spectral analysistransforms but cover a much larger class of parallel processing andmultiprocessing systems.

BACKGROUND OF THE INVENTION

Most computer arithmetic operations encountered in informationprocessing algorithms in general and signal processing and sortingalgorithms in particular call for iterative multiplications of largematrices. An approach and a formalism for designing optimalparallel/pipelined algorithms and processor architectures for effectingsuch operations has been recently proposed in Optimal Parallel andPipelined Processing Through a New Class of Matrices with Application toGeneralized Spectral Analysis”, Michael J. Corinthios, IEEE Trans.Comput., Vol. 43, April 1994, pp. 443-459. The algorithms are optimal intheir minimization of addressing requirements, of shuffle operations andof the number of memory partitions they call for. The algorithms andcorresponding architectures involve general base matrix factorizations.As an application, the factorizations and corresponding optimalarchitectures are developed in Optimal Parallel and Pipelined ProcessingThrough a New Class of Matrices with Application to Generalized SpectralAnalysis”, Michael J. Corinthios, IEEE Trans. Comput., Vol. 43, April1994, pp. 443-459, to obtain optimal parallel-pipelined processors forthe Generalized Walsh-Chrestenson transform, of which the Discrete(fast) Fourier transform is but a special case.

SUMMARY OF THE INVENTION

This invention describes a technique for designing optimalmultiprocessing parallel architectures which employ multiples ofgeneral-base processors operating in parallel in an optimal globalarchitecture. A formalism and closed forms are developed defining thestate and sequencing assignments in a programmable hierarchical level ofparallelism at each step of the algorithm execution.

A class of hierarchically parallel multiprocessing architecturesemploying general-base universal processing elements previouslyintroduced as basic tools for multiprocessing as in 3-D cellular arraysfor parallel/cascade image/signal processing”, Michael J. Corinthios, inSpectral Techniques and Fault Detection, M. Karpovsky, Ed. New York:Academic Press, 1985, “The Design of a class of Fast Fourier TransformComputers”, Michael J. Corinthios IEEE Trans. Comput., Vol. C-20, pp.617-623, June 1971 is presented. Applications of the perfect shufflematrices and hypercube representations to other classes of problems suchas sorting and interconnection networks have received attention over thecourse of many years in 3-D cellular arrays for parallel/cascadeimage/signal processing”, Michael J. Corinthios in Spectral Techniquesand Fault Detection, M. Karpovsky, Ed. New York: Academic Press, 1985,“The Design of a class of fast Fourier Transform Computers”, Michael J.Corinthios IEEE Trans. Comput., Vol. C-20, pp. 617-623, June 1971, “AParallel Algorithm for State Assignment of Finite State Machines”, G.Hasteer and P. Banerjee, IEEE Trans. Comput., vol. 47, No. 2, pp.242-246, February 1998, “Hypercube Algorithms and Implementations”, O.A. Mc Bryan and E. F. Van De Velde, SIAM J. Sci. Stat. Comput., Vol. 8,No. 2, pp. s227-287, Mar. 1987, “Parallel Processing with the Perfect”,H. S. Stone, IEEE Trans. Comput. Vol. C-20, No. 2, pp. 153-161, February1971, “Design of a Massively Parallel Processor”, K. E. Batcher, IEEETrans. Comput, pp 836-840, September 1980. Advances in state assignmentand memory allocation for array processors, using processing elements asmultiprocessing cells, and their interconnection networks have been madein the last two decades by Parallel Processing with the Perfect”, H. S.Stone, IEEE Trans. Comput. Vol. C-20, No. 2, pp. 153-161, February 1971,“Hierarchical Fat Hypercube Architecture for Parallel ProcessingSystems”, Galles, Michael B., U.S. Pat. No. 5,669,008, September 1997.Many of these contributions applied parallel and multiprocessingarchitectures to signal processing applications and in particularspectral analysis algorithms. In more recent years applications ofparallel and multiprocessing techniques have focused on generalizedspectral analysis, Discrete Cosine, Haar, Walsh and ChrestensonTransforms, among others in Optimal Parallel and Pipelined ProcessingThrough a New Class of Matrices with Application to Generalized SpectralAnalysis”, Michael J. Corinthios, IEEE Trans. Comput., Vol. 43, April1994, pp. 443-459.”, “3-D cellular arrays for parallel/cascadeimage/signal processing”, Michael J. Corinthios in Spectral Techniquesand Fault Detection, M. Karpovsky, Ed. New York: Academic Press, 1985,“Parallel Processing with the Perfect”, H. S. Stone, IEEE Trans. Comput.Vol. C-20, No. 2, pp. 153-161, February 1971, “Access and Alignment ofData in an Array Processor”, D. H. Lawrie, IEEE Trans. Comput., volC-24, No. 2, December 1975, pp 1145-1155, “Fast Fourier Transforms overFinite Groups by Multiprocessor Systems”, Roziner, T. D., Karpovsky, M.G., and Trachtenberg, L. A., IEEE Trans. Accous., Speech, and Sign.Proc., ASSP, vol. 38, No. 2, February 1990, pp 226-240, “An Architecturefor a Video Rate Two-Dimensional Fast Fourier Transform processor”,Taylor, G. F., Steinvorth, R. H., and MacDonald J., IEEE Trans. Comput.,vol. 37, No. 9, September 1988, pp 1145-1151. “Fault tolerant FFTNetworks”, IEEE Trans. Comput., vol. 37, No. 5, May 1988, pp. 548-561,Jou, Y.-Y. and Abraham, J. A., “Design of Multiple-Valued SystolicSystem for the Computation of the Chrestenson Spectrum”, Moraga,Claudio, IEEE Trans. Comput., Vol. C-35, No. 2, February 1986, pp183-188. “Matrix Representation for Sorting and the Fast FourierTransform”, Sloate, H., IEEE Trans. Circ. And Syst., Vol. CAS-21, No. 1,January 1974, pp 109-116, “Processor for Signal processing andHierarchical Multiprocessing Structure Including At Least One SuchProcessor”, Luc Mary and Barazesh, Bahman, U.S. Pat. No. 4,845,660, July1989. In 3-D cellular arrays for parallel/cascade image/signalprocessing”, Michael J. Corinthios, in Spectral Techniques and FaultDetection, M. Karpovsky, Ed. New York: Academic Press, 1985,three-dimensional parallel and pipelined architectures of cellular arraymultiprocessors employ Configurable Universal Processing Elements (CUPE)forming what were referred to as ‘Isostolic Arrays’, applied to signalsas well as images in Optimal Parallel and Pipelined Processing Through aNew Class of Matrices with Application to Generalized SpectralAnalysis”, Michael J. Corinthios, IEEE Trans. Comput., Vol. 43, April1994, pp. 443-459.”, “3-D cellular arrays for parallel/cascadeimage/signal processing”, Michael J. Corinthios, in Spectral Techniquesand Fault Detection, M. Karpovsky, Ed. New York: Academic Press, 1985.

Many patents of invention deal with the subject of hypercubetransformations such as described in U.S. Pat. Nos. 5,669,008,5,644,517, 5,513,371, 5,689,722, 5,475,856, 5,471,412, 4,980,822,916,657 and 4,845,660. The present invention is unique in its concept ofa generalized level of massive parallelism. The formulation is presentedfor an arbitrary number of M processing elements, M=p^(m), p being thegeneral radix of factorization. The input data vector dimension N, orinput data matrix dimension N×N, where N=p^(n), the radix offactorization of the matrix p, the number of processors M, and the span,Optimal Parallel and Pipelined Processing Through a New Class ofMatrices with Application to Generalized Spectral Analysis”, Michael J.Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459, ofthe matrix are all variable. A unique optimal solution yielding parallelto massively parallel optimal architectures, as optimality is defined inOptimal Parallel and Pipelined Processing Through a New Class ofMatrices with Application to Generalized Spectral Analysis”, Michael J.Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459 ispresented.

The approach, which was recently submitted for publication, andsubmitted as a Disclosure Document is illustrated by developing aformalism and optimal factorizations for the class of algorithms ofgeneralized spectral analysis introduced recently in Optimal Paralleland Pipelined Processing Through a New Class of Matrices withApplication to Generalized Spectral Analysis”, Michael J. Corinthios,IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459. It has been shownin Optimal Parallel and Pipelined Processing Through a New Class ofMatrices with Application to Generalized Spectral Analysis”, Michael J.Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459, thattransforms such as Fourier and more generally ChrestensonGeneralized-Walsh (CGW) transforms can be factored into optimal forms.

Basic Definitions

In what follows we use some matrix definitions, such as the definitionof a sampling matrix, a matrix poles and zeros, a matrix span, fixedtopology processor and shuffle-free processor introduced in OptimalParallel and Pipelined Processing Through a New Class of Matrices withApplication to Generalized Spectral Analysis”, Michael J. Corinthios,IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459. In addition weadopt the following definitions which will be formally introduced inlatter sections.

General Base Processing Element

In what follows a general-base processing element PE with a base, orradix, p is a processor that receives simultaneously p input operandsand produces simultaneously p output operands. The PE in general appliesarithmetic or weighting operations on the input vector to produce theoutput vector. In matrix multiplication operations for example the PEapplies a p×p matrix to the p-element input vector to produce thep-element output vector. The matrix elements may be real or complex.

Due to the diversified general applicability of such a processingelement a Universal Processing Element UPE, which can be constructed ina 3D-type architecture has been recently proposed in Optimal Paralleland Pipelined Processing Through a New Class of Matrices withApplication to Generalized Spectral Analysis”, Michael J. Corinthios,IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459.”, “3-D cellulararrays for parallel/cascade image/signal processing”, Michael J.Corinthios, in Spectral Techniques and Fault Detection, M. Karpovsky,Ed. New York: Academic Press, 1985. Its 3D-type architecture is suchthat its intermediate computation results are propagated between planesrather than in 2D along a plane. It may be viewed as a base-p processingelement, operating on the p elements of an input vector simultaneously,applying to it a general p×p matrix and producing p output operands asthe p-element output vector. A UPE has p×p=p² multipliers but may beinstead realized in a 3D architecture, in particular if the matrix is atransformation matrix that can be itself factored as in “3-D cellulararrays for parallel/cascade image/signal processing”, Michael J.Corinthios, in Spectral Techniques and Fault Detection, M. Karpovsky,Ed. New York: Academic Press, 1985, “The Design of a class of FastFourier Transform Computers”, Michael J. Corinthios IEEE Trans. Comput.,Vol. C-20, pp. 617-623, June 1971. The pipelining approach described inOptimal Parallel and Pipelined Processing Through a New Class ofMatrices with Application to Generalized Spectral Analysis”, Michael J.Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459, canthus be used, leading to a 3D-type “isostolic” architecture OptimalParallel and Pipelined Processing Through a New Class of Matrices withApplication to Generalized Spectral Analysis”, Michael J. Corinthios,IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459.

In the context of this invention a UPE may be seen simply as a generalbase-p processing element PE as defined above, accepting p inputs,weighting them by the appropriate p×p matrix and producing p outputoperands.

Pilot Elements, Pilots Matrix

Similarly to signals and images an N×N matrix may be sampled and theresult is “impulses”, that is, isolated elements in the resulting N×Nsamples (sampling) matrix. We shall assume uniform sampling of rows andcolumns yielding p uniformly spaced samples from each of p rows andelement alignment along columns, that is, p uniformly spaced samplesalong columns as well as rows. The samples matrix which we may refer toas a “frame” thus contains p rows of p equally spaced elements each, arectangular grid of p² impulses (poles) which we shall call a“dispatch”. With N=p^(n) the N² elements of the “main” (or “parent”)matrix may be thus decomposed into N²/p²=p^(n−2) such dispatches.

By fixing the row sampling period, the row span, as well as the columnsampling period, the column span, it suffices to know the coordinates(indices) of the top left element, that is, the element with thesmallest of indices, of a dispatch to directly deduce the positions ofall its other poles (elements). The top left element acts thus as areference point, and we shall call it the “pilot element”. The otherp²−1 elements associated with it may be called its “satellites”.

In other words if the element a_(ij) of A is a pilot element, thedispatch consists of the elements

a_(i+kc,j+lr); k=0,1, . . . ,p−1, l=0,1, . . . ,p−1

c and r being the column and row element spacings (spans), respectively.

A processing element assigned to a pilot element can thus access all p²operands of the dispatch, having deduced their positions knowing thegiven row and column spans.

Since each pilot element of a frame originated from the same position inthe parent matrix we can construct a “pilots matrix” by keeping only thepilot elements and forcing to zero all other elements of the parentmatrix. The problem then is one of assignment, simultaneous and/orsequential, of the M=p^(m) processors to the different elements of thepilots matrix.

Hypercube Dimension Reduction

The extraction of a pilots matrix from its parent matrix leads to adimension reduction of the hypercube representing its elements. Thedimension reduction is in the form of a suppression, that is, a forcingto zero, of one of the hypercube digits. Let C=(j_(n−1) . . . j₁j₀)_(p)be an n-digit base-p hypercube. We will write C_({overscore (k)}) todesignate the hypercube C with the digit k suppressed, that is, forcedto zero. Several digits can be similarly suppressed. For example,C_({overscore (2)},{overscore (4)})=(j_(n−1) . . . j₅0j₃0j₁j₀)_(p), andC_({overscore (n−1)})=(0j_(n−2) . . . j₁j₀)_(p). It is interesting tonote that the hypercube dimension reduction implies a “skipping” overits zeros in permutation operations such as those involving the perfectshuffle. For example, if A=C_({overscore (2)}) then PA=(j₀j_(n−1) . . .j₅0j₃j₁)_(p).

State Assignment Algorithm

A sequence of perfect shuffle operations effected through simplehypercube transformations can be made to broadcast the state and accessassignments to the different processors. The overall approach isdescribed by the following algorithm which will be developed step bystep in what follows.

Algorithm 1: Parallel Dispatch, State Assignment and SequencingAlgorithm Read base p n = log_(p) N m = log_(p) M Read Input matrix nameA For k=0 to n−1 do  For r = 0 to n−2 do   begin Assign variables i₀,i₁, . . . , i_(m−1) to M = p^(m) processors Evaluate row span σ_(R)Evaluate column span σ_(c) Test optimality Select scan type Evaluatepitch Dispatch M parallel processors Assign variables j_(m), j_(m+1), .. . , j_(n−1) to the access sequencing order    of each processor.Effect hypercube transformations,   (j_(n−1) . . . j_(m+1)j_(m)i_(m−1) .. . i₁i₀) → (j_(n−1) . . . j_(m+1)j_(m)i_(m−1) . . . i₁i₀)′ for k = 0 top^(n−m−1) do  begin   Fork NEXT    Dispatch processor to Pilot address${\left. \begin{matrix}{{w\left( {j_{n - 1}\quad \ldots \quad j_{m + 1}j_{m}i_{m - 1}\quad \ldots \quad i_{1}i_{0}} \right)}^{\prime},} \\{{z\left( {j_{{n - 1}\quad}\quad \ldots \quad j_{m + 1}j_{m}i_{m - 1}\quad \ldots \quad i_{1}i_{0}} \right)}^{''}.}\end{matrix} \right\} 0} \leq l \leq {m - 1}$

NEXT   for s = 0, 1, . . . , p−1 w_(R)(s) ← w + s σ_(R) z_(c)(s) ← z + sσ_(c) end  end Increment j for sequential cycles  end end

The Parallel Dispatch, State Assignment and Sequencing Algorithm 1dispatches the M=p^(m) processors for each stage of the matrixfactorization. The base-p m tuple (i_(m−1)i_(n−2) . . . i_(l)i₀)_(p) isassigned to the parallel processors. The (n-m) tuple (j_(n−1)j_(n−2) . .. j_(m)) is assigned to the sequencing cycles of each processor. Thealgorithm subsequently applies hypercube transformations as dictated bythe type of matrix, the stage of matrix factorization and the number ofdispatched processors. It tests optimality to determine the type of scanof matrix elements to be applied and evaluates parameters such as pitchand memory optimal queue length, to be defined subsequently, it accessesthe pilot elements and their satellites, proceeding to the paralleldispatch and sequencing of the processing elements.

Each processing element at each step of the algorithm thus accesses frommemory its p input operands and writes into memory those of its outputoperands. The algorithm, while providing an arbitrary hierarchical levelof parallelism up to the ultimate massive parallelism, produces optimalmultiprocessing machine architecture minimizing addressing, the numberof memory partitions as well as the number of required shuffles.Meanwhile it produces virtually wired-in pipelined architecture andproperly ordered output.

Matrix Decomposition

In developing techniques for the multiprocessing of matrixmultiplications it is convenient to effect a decomposition of a matrixinto the sum of matrices. To this end let us define an “impulse matrix”as the matrix δ(i,j) of which all the elements are zero except for theelement at position (i,j), that is, $\begin{matrix}{\left\lbrack {\delta \left( {i,j} \right)} \right\rbrack_{uv} = \left\{ \begin{matrix}{1;} & {{u = i},} & {v = j} \\{0;} & {otherwise} & \quad\end{matrix} \right.} & \text{(4.1)}\end{matrix}$

An N×N matrix A having elements [A]_(i,j)=a_(ij) can be written as thesum

A=a_(0,0)δ(0,0)+a_(0,1)δ(0,1)+a_(0,2)δ(0,2)+ . . .+a_(1,0)δ(1,0)+a_(1,1)δ(1,1)+ . . . +a_(N1−1, N−1)δ(N−1, N−1)  (4.2)

where the δ(i,j) matrices are of dimension N×N each. The matrix A canthus be written in the form $\begin{matrix}{A = {\sum\limits_{i = 0}^{N - 1}\quad {\sum\limits_{j = 0}^{N - 1}{a_{i,j}{\delta \left( {i,j} \right)}}}}} & \text{(4.3)}\end{matrix}$

Furthermore, in the parallel processing of matrix multiplication to ageneral base p it is convenient to decompose an N×N matrix with N=p^(n)as the sum of dispatches, a dispatch being, as mentioned earlier, amatrix of p² elements arranged in a generally rectangular p×p pattern ofp columns and p rows. Denoting by σ_(R) and σ_(C) the row and columnsspans in Optimal Parallel and Pipelined Processing Through a New Classof Matrices with Application to Generalized Spectral Analysis”, MichaelJ. Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459, ofa dispatch we can decompose a matrix A into the form $\begin{matrix}{A = {\sum\limits_{i = 0}^{{N/p} - 1}{\sum\limits_{j = 0}^{{N/p} - 1}{\sum\limits_{k = 0}^{p - 1}{\sum\limits_{l = 0}^{p - 1}{a_{{i + {k\quad \sigma_{C}}},{j + {l\quad \sigma_{R}}}}{\delta \left( {{i + {k\quad \sigma_{C}}},{j + {l\quad \sigma_{R}}}} \right)}}}}}}} & (4.4)\end{matrix}$

More generally we may wish to decompose A in an order different from theuniform row and column scanning as in this last equation. In other wordswe may wish to pick the dispatches at an arbitrary order rather than insequence. As mentioned above, we shall call the top left element thepilot element and its p²−1 companions its satellites. In this lastequation the pilot elements are those where k=1=0.

To effect a parallel matrix decomposition to a general base p we usehypercubes described by base-p digits. The order of accessing thedifferent dispatches is made in relation to a main clock. The clock K isrepresented by the hypercube to base p as

K≅(k_(n−1) . . . k₁k₀)_(p); k_(i)ε{0,1, . . . , p−1}  (4.5)

Its value at any time is given by $\begin{matrix}{K = {\sum\limits_{t = 0}^{n - 1}{p^{t}k_{t}}}} & \text{(4.6)}\end{matrix}$

At each clock value K a set of M UPE's (PE's) is assigned a set of Mdispatches simultaneously. We will reserve the symbols w and z todesignate the row and column indices of a pilot element at clock K. Inother words, at clock K each selected pilot element shall be designateda_(w,z), that is, [A]_(w,z) where w and z are functions of K to bedefined. They will be determined in a way that optimizes the paralleland sequential operations for the given matrix structure and the numberM=p^(m) of available UPE's.

With M=p^(m) base-p processing elements the hypercube representing Kshall be re-written in the form

K≅(j_(n−1) . . . j_(m+1)j_(m)i_(m−1) . . . i₁i₀)_(p)  (4.7)

where we have written $\begin{matrix}{k_{t} = \left\{ \begin{matrix}{i_{t};} & {{t = 0},1,\ldots \quad,{m - 1}} \\{j_{t};} & {{t = m},{m + 1},\ldots \quad,{n - 1}}\end{matrix} \right.} & \text{(4.8)}\end{matrix}$

The m-sub-cube (i_(m−1), . . . i₁i₀) designates operations performed inparallel. The remaining (n-m)-sub-cube (j_(n−1), . . . j_(m+1), j_(m))designates operations performed sequentially by each of the M dispatchedparallel processors. With M=p^(m) processors dispatched in parallel atclock K≅(j_(n−1) . . . j_(m+1)j_(m)i_(m−1) . . . i₁i₀)_(p) the matrix Acan be decomposed in the form $\begin{matrix}{A = {\sum\limits_{k_{n - 2} = 0}^{p - 1}\quad {\ldots \quad {\sum\limits_{k_{m + 1} = 0}^{p - 1}{\sum\limits_{k_{m} = 0}^{p - 1}{\langle{\sum\limits_{k_{m - 1} = 0}^{p - 1}\quad {\ldots \quad {\sum\limits_{k_{1} = 0}^{p - 1}{\sum\limits_{k_{0} = 0}^{p - 1}{\sum\limits_{l = 0}^{p - 1}{\sum\limits_{k = 0}^{p - 1}{a_{{{w{({k_{0},k_{1},\quad \cdots \quad,k_{n - 1}})}} + {k\quad \sigma_{C}}},{{z{({k_{0},k_{1},\quad \cdots \quad,k_{n - 1}})}} + {l\quad \sigma_{R}}}}{\delta \left\lbrack {\left\{ {{w\left( {k_{0},k_{1},\cdots \quad,k_{n - 2}} \right)} + {k\quad \sigma_{C}}} \right\},\left\{ {{z\left( {k_{0},k_{1},\cdots \quad,k_{n - 2}} \right)} + {l\quad \sigma_{R}}} \right\}} \right\rbrack}}}}}}}}\rangle}}}}}} & \text{(4.9)}\end{matrix}$

Where the “parentheses” < and > enclose the elements accessed inparallel. In what follows we write P_(v,μ) to designate the pilotelement of processor No.v at real time clock μ.

The CGW Transform

The lowest order base-p Chrestenson Generalized Walsh “core matrix” isthe p-point Fourier matrix $\begin{matrix}{{W_{p} = {\frac{1}{\sqrt{p}}\quad\begin{bmatrix}w^{0} & w^{0} & \cdots & w^{0} \\w^{0} & w^{1} & \cdots & w^{p - 1} \\\vdots & \quad & \quad & \quad \\w^{0} & w^{p - 1} & \cdots & w^{{({p - 1})}^{2}}\end{bmatrix}}},} & \text{(5.1)}\end{matrix}$

where

w=exp(−j2π/p); j={square root over (−1+L .)}  (5.2)

In the following, for simplicity, the scaling factor 1/{square root over(p)} will be dropped. We start by deriving three basic forms of theChrestenson transform in its three different orderings.

The GWN Transformation Matrix

The GWN transformation matrix W_(N) for N=p^(n) data points is obtainedfrom the Generalized-Walsh core matrix W_(p) by the Kronekermultiplication of W_(p) by itself n times.

W_(N,nat)=W_(p)xW_(p)x . . . xW_(p)(n times)=W_(p) ^([n]).  (5.3)

GWP Transformation Matrix

The Generalized Walsh transform in the GWP order is related to thetransform in natural order by a digit-reverse ordering. The general-basedigit reverse ordering matrix K_(N) ^((p)) can be factored using thegeneral-base perfect shuffle permutation matrix P^((p)) and Kronekerproducts $\begin{matrix}{K_{N}^{(p)} = {\prod\limits_{i = 0}^{n - 1}{\left( {P_{p^{({n - i})}}^{(p)} \times I_{p^{i}}} \right).}}} & \text{(5.4)}\end{matrix}$

Operating on a column vector x of dimension K the base-p Perfect Shufflepermutation matrix of dimension K×K produces the vector

P_(K)x=[x₀,x_(K/p),x_(2K/p), . . . ,x_((p−1)K/p),x₁,x_(K/p+1), . . .,x₂,x_(K/p+2), . . . ,x_(K−1)]  (5.5)

The GWP matrix W_(N,WP) can thus be written in the form $\begin{matrix}\begin{matrix}{W_{N,{WP}} = {K_{N}^{(p)}{W_{N,{nat}}.}}} \\{= {\prod\limits_{i = 0}^{n - 1}{\left( {P_{p^{({n - i})}}^{(p)} \times I_{p^{i}}} \right){W_{p}^{\lbrack n\rbrack}.}}}}\end{matrix} & \text{(5.6)}\end{matrix}$

GWK Transformation Matrix

The GWK transformation matrix is related to the GWP matrix through ap-ary to Gray transformation matrix G_(N) ^((p)).

W_(N,WK)=G_(N) ^((p))W_(N,WP).  (5.7)

The following factorizations lead to shuffle-free optimalparallel-pipelined processors.

A. GWN Factorization

A fixed topology factorization of the GWN transformation matrix has theform $\begin{matrix}{W_{N,{nat}} = {{\prod\limits_{i = 0}^{n - 1}{P_{N}C_{N}}} = {\prod\limits_{i = 0}^{n - 1}{{P_{N}\left( {I_{N/p} \times W_{p}} \right)}.}}}} & \text{(5.7)}\end{matrix}$

which can be re-written in the form $\begin{matrix}{{W_{N,{nat}} = {{P\left\{ {\prod\limits_{n = 0}^{n - 1}{CP}} \right\} P^{- 1}} = {P\left\{ {\prod\limits_{n = 0}^{n - 1}F} \right\} P^{- 1}}}},} & \text{(5.8)} \\{C = {C_{N} = {I_{p^{n - 1}} \times W_{p}}}} & \text{(5.9)}\end{matrix}$

And F=CP, noting that the matrix F is p²-optimal in Optimal Parallel andPipelined Processing Through a New Class of Matrices with Application toGeneralized Spectral Analysis”, Michael J. Corinthios, IEEE Trans.Comput., Vol. 43, April 1994, pp. 443-459.

B. GWP Factorization

We fixed topology factorization of the GWP matrix has the form$\begin{matrix}{W_{N,{WP}} = {\prod\limits_{i = 0}^{n - 1}{J_{i}C_{N}}}} & (5.10)\end{matrix}$

 J_(i)=(I_(p) _(^(n−i−1)) xP_(p) _(^(i+1)) )=H_(n−i−1)  (5.11)

Letting

Q_(i)=C_(N)J_(i+1)=C_(n)H_(n−i−2);i=0,1, . . . ,n−2Q_(n−1)=C_(N)  (5.12)

we obtain $\begin{matrix}{{W_{N,{WP}} = {\prod\limits_{i = 0}^{n - 1}Q_{i}}},} & (5.13)\end{matrix}$

where each matrix Q_(i); i=0, 1, . . . , n−2, is p²-optimal, whileQ_(n−1) is p-optimal.

C. GWK Factorization

The fixed topology GWK factorization has the form $\begin{matrix}{W_{N,{WK}} = {P\left\{ {\prod\limits_{i = 0}^{n - 1}{P^{- 1}H_{i}C_{N}E_{i}}} \right\} {P^{- 1}.}}} & \text{(5.14)}\end{matrix}$

Letting

H_(i)=I_(p) _(^(i)) xP_(p) _(^(n−i)) ,E_(i)=I_(p) _(^(i)) xD′_(p)_(^(n−i))   (5.15)

D′_(p) _(^(n)) =quasidiag(I_(p) _(^(n−1)) ,D_(p) _(^(n−1)) ,D² _(p)_(^(n−1, . . . , D)) _(p) _(^(n−1)) ^((p−1)))  (5.16)

 D^(i) _(p) _(^(n−1=D)) ^(i) _(p)xI_(p) _(^(n−2))

D_(p)=diag(w⁰,w⁻¹,w⁻², . . . ,w^(−(p−1))).  (5.17)

$\begin{matrix}{{W_{N,{WK}} = {P\left\{ {\prod\limits_{i = 0}^{n - 1}{P^{- 1}H_{i}G_{i}}} \right\} P^{- 1}}},} & \text{(5.18)}\end{matrix}$

where

G_(i)=C_(N)E_(i).  (5.19)

Letting

S_(i)=P⁻¹H_(i)P=(I_(p) _(^(i−1)) xP_(p) _(^(n−i)) xI_(p))  (5.20)

we have $\begin{matrix}{W_{N,{WK}} = {P^{2}\left\{ {\prod\limits_{i = 0}^{n - 1}{P^{- 1}G_{i}S_{i + 1}}} \right\} P^{- 1}}} & \text{(5.21)}\end{matrix}$

with

S_(n−1)=S_(n)=I_(N).  (5.22)

The factorization can also be re-written in the form $\begin{matrix}{{W_{N,{WK}} = {P\left\{ {\prod\limits_{i = 0}^{n - 1}\Gamma_{i}} \right\} P^{- 1}}},} & \text{(5.23)}\end{matrix}$

where $\begin{matrix}\begin{matrix}{\Gamma_{i} = {P^{- 1}G_{i}S_{i + 1}}} \\{{= {{P^{- 1}{G_{i}\left( {I_{p^{i}} \times P_{p^{n - i - 1}} \times I_{p}} \right)}\quad i} = 1}},2,\cdots \quad,{{n - 1};}} \\{\Gamma_{0} = {G_{0}{S_{1}.}}}\end{matrix} & \text{(5.24)}\end{matrix}$

The matrices Γ_(i) are p²-optimal, except for Γ₀ which is maximal span.These are therefore optimal algorithms which can be implemented by anoptimal parallel processor, recirculant or pipelined, with no shufflingcycle called for during any of the n iterations.

Image Processing

The potential in enhanced speed of processing of the optimal algorithmsis all the more evident within the context of real-time image processingapplications. For 2D signals, algorithms of generalized spectralanalysis can be applied on sub-images or on successive column-rowvectors of the input image. Factorizations of the algorithms of theChrestenson transform applied on an N×N points matrix X representing animage, with N=p^(n) can be written for the different transform matrices.The GWN 2D transformation for optimal pipelined architecture can bewritten in the form $\begin{matrix}\begin{matrix}{Y_{nat} = {P\left\{ {\prod\limits_{i = 0}^{n - 1}F} \right\} P^{- 1} \times \left\lbrack {P\left\{ {\prod\limits_{i = 0}^{n - 1}F} \right\} P^{- 1}} \right\rbrack^{T}}} \\{{= {P\left\{ {\prod\limits_{i = 0}^{n - 1}F} \right\} P^{- 1} \times P\left\{ {\prod\limits_{i = 0}^{n - 1}F} \right\} P^{- 1}}},}\end{matrix} & \text{(6.1)}\end{matrix}$

where T stands for transpose. The GWP factorization can be written inthe form $\begin{matrix}\begin{matrix}{Y_{WP} = {\prod\limits_{i = 0}^{n - 1}{Q_{i} \times \left( {\prod\limits_{i = 0}^{n - 1}Q_{i}} \right)^{T}}}} \\{{= {\prod\limits_{i = 0}^{n - 1}{Q_{i} \times {\prod\limits_{i = 0}^{n - 1}Q_{n - i - 1}^{T}}}}},}\end{matrix} & \text{(6.2)} \\{Q_{i}^{T} = {{C_{N}\left( {I_{p^{n - i - 1}} \times P_{p^{i + 1}}^{- 1}} \right)}.}} & \text{(6.3)}\end{matrix}$

The GWK factorization for optimal pipelined architecture can be writtenin the form $\begin{matrix}\begin{matrix}{Y_{WK} = {P^{2}\left\{ {\prod\limits_{i = 0}^{n - 1}\Gamma_{i}} \right\} P \times \left\lbrack {P^{2}\left\{ {\prod\limits_{i = 0}^{n - 1}\Gamma_{i}} \right\} P} \right\rbrack^{T}}} \\{{= {P^{2}\left\{ {\prod\limits_{i = 0}^{n - 1}\Gamma_{i}} \right\} P \times P^{- 1}\left\{ {\prod\limits_{i = 0}^{n - 1}\Gamma_{n - i - 1}^{T}} \right\} P^{- 2}}},}\end{matrix} & \text{(6.4)} \\{\Gamma_{i}^{T} = {\left( {I_{p^{i}} \times P_{p^{n - i - 1}}^{- 1} \times I_{p}} \right)G_{i}^{- 1}{P.}}} & \text{(6.5)}\end{matrix}$

These fast algorithms are all p²-optimal requiring no shuffling betweeniterations of a pipelined processor. In applying these factorizationsthe successive iterations are effected on successive sub-images suchthat after log_(p) N stages the transform image Y is pipelined at theprocessor output. Applications include real-time processing of videosignals.

The Fourier transform is but a special case of the ChrestensonGeneralized Walsh transform. The Fourier matrix for N points is thematrix F_(N) defined above in (1) with p replaced by N: $\begin{matrix}{F_{N} = \begin{bmatrix}w^{0} & w^{0} & \cdots & w^{0} \\w^{0} & w^{1} & \cdots & w^{N - 1} \\w^{0} & w^{2} & \cdots & w^{2{({N - 1})}} \\w^{0} & w^{N - 1} & \cdots & w^{{({N - 1})}^{2}}\end{bmatrix}} & (6.6)\end{matrix}$

For images the factorization leads to the optimal form $\begin{matrix}{Y_{F} = {\left\{ {\prod\limits_{i = 0}^{n - 1}F_{i}} \right\} \times \left\{ {\prod\limits_{k = 0}^{n - 1}F_{n - k - 1}} \right\}}} & \text{(6.7)}\end{matrix}$

and for unidimensional signals the corresponding form for the Fouriermatrix is $\begin{matrix}{F_{N} = {\prod\limits_{i = 0}^{n - 1}\left( F_{i} \right)}} & \text{(6.8)}\end{matrix}$

 F_(i)=U_(i)C_(i)

C_(i)=CJ_(i+1); i=0,1, . . . ,n−1

C_(n−1)=C  (6.9)

U₁=I_(N)

U_(i)=I_(p) _(^(n−i−1)) xD_(p) _(^(i+1)) =I_(p) _(^(n−i−1)) xD_(N/p)_(^(n−i−1))

D_(N/m)=diag(I_(N/(pm)),K_(m),K₂ ^(m), . . . ,K_(m) ^(p−1))

K_(t)=diag(w⁰,w^(t), . . . ,w^([N/(mp)−1]t))  (6.10)

Perfect Shuffle Hypercube Transformations

The hypercube transformations approach is illustrated using theimportant matrices of the Chrestenson Generalized Walsh-Paley (CGWP),Generalized Walsh-Kaczmarz (CGWK) and Fourier transforms.

We note that the matrices C_(k) in the Fourier transform expansion areclosely related to the matrices J_(i) and H_(i) in the ChrestensonGeneralized Walsh Paley factorization. In fact the following relationsare readily established:

C_(N) ΔC

C_(i)=CJ_(i+1)=CH_(n−i−2)=Q_(i)  (7.1)

Q_(n−1)=C_(n−1)=C  (7.2)

Therefore, the CGWP matrices Q_(i) are the same as the C_(i) matricesand have the same structure as the F_(i) matrices in the Fourier matrixfactorization. Writing

B_(k)=CH_(k)  (7.3)

H_(k)=I_(p) ^(k)xP_(p) ^(_(n−k))   (7.4)

the post-multiplication by H_(k) has the effect of permuting the columnsof C so that at row w,

w≅(0j_(n−2) . . . j₁j₀)  (7.5)

the pilot element is at column z as determined by the permutation H_(k),that is,

z≅(j_(k)0j_(n−2) . . . j_(k+1)j_(k−1) . . . j₁j₀)  (7.6)

with the special case k=n−2 producing

z≅(j_(n−2)0j_(n−3) . . . j₁j₀)  (7.7)

and that of k=n−1 yielding

z≅(0j_(n−2) . . . j₁j₀)  (7.8)

Alternatively, we can write z directly as a function of w by usingpreviously developed expressions of permutation matrices in OptimalParallel and Pipelined Processing Through a New Class of Matrices withApplication to Generalized Spectral Analysis”, Michael J. Corinthios,IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459. For example,

B₀=CH₀=CP  (7.9)

and using the expression defining P, namely, $\begin{matrix}{\left\lbrack P_{p^{n}}^{k} \right\rbrack_{uv} = \left\{ {{{\begin{matrix}{1;} & {{u = 0},1,\cdots \quad,{{p^{n} - 1};}} \\\quad & {v = {\left\lbrack {u + {\left( {u\quad {mod}\quad p^{k}} \right)\left( {p^{n} - 1} \right)}} \right\rbrack/p^{k}}} \\{0;} & {{otherwise};}\end{matrix}k} = 0},1,\cdots \quad,{N - 1},} \right.} & (7.10)\end{matrix}$

with k=1, we can write

z=[w+(w mod p)(p^(n)−1)]/p  (7.11)

a relation that defines the pilot elements matrix.

Similarly,

B₁=CH₁=C(I_(p)xP_(p) ^(N−1))  (7.12)

and from the definition given in Optimal Parallel and PipelinedProcessing Through a New Class of Matrices with Application toGeneralized Spectral Analysis”, Michael J. Corinthios, IEEE Trans.Comput., Vol. 43, April 1994, pp. 443-459: $\begin{matrix}{\left\lbrack P_{i}^{t} \right\rbrack_{uv} = \left\{ \begin{matrix}{1;} & {{u = 0},1,\cdots \quad,{{p^{n} - 1};}} \\\quad & {v = {p^{i - {t\quad {{mod}{({n - i})}}}}\left\lbrack {{p^{- i}\left( {u - {u\quad {mod}\quad p^{i}}} \right)} +} \right.}} \\\quad & \left\{ {\left\lbrack {p^{- i}\left( {u - {u\quad {mod}\quad p^{i}}} \right)} \right\rbrack \quad {mod}\quad p^{t\quad {mod}\quad {({n - i})}}} \right\} \\\quad & {{\left. \left( {p^{n - i} - 1} \right) \right\rbrack + {u\quad {mod}\quad p^{i}}};} \\{0;} & {{otherwise};}\end{matrix} \right.} & (7.13)\end{matrix}$

with i=1 and t=1 we have

z=[p⁻¹(w−w mod p)+{[p⁻¹(w−w mod p)]mod p}(p^(n−1)−1)]+w mod p.  (7.14)

Consider the permutation matrix

R_(N)=R_(p) ^(_(n)) =I_(p) ^(_(m)) xP_(p) ^(_(j)) xI_(p) ^(_(k))  (7.15)

Let the base-p hypercube describing the order in a vector x of N=p^(n)elements be represented as the n-tuple.

x≅(j_(n−1) . . . j₁j₀)_(p)j_(i)ε{0,1, . . . ,p−1}  (7.16)

The application of the matrix R_(p) _(^(N)) on the n-tuple vector x,results in the n-tuple:

v=(j_(n−1) . . . j_(n−k+1)j_(n−k)j_(m)j_(n−k−1) . . .j_(m+2)j_(m+1)j_(m−1) . . . j₁j₀)  (7.17)

We note that with respect to x the left k digits and the right m digitsare left unchanged while the remaining digits are rotated using acircular shift of one digit to the right.

The pilot-elements matrix β_(k) corresponding to the matrix B_(k) isobtained by restricting the values of w (and hence the corresponding zvalues) to w=0, 1, . . . , p^(n−1)−1.

Moreover, we note that if we write

L_(i)=P⁻¹G_(i)=P^(n−1)G_(i)  (7.18)

and note that G_(i) is similar in structure to C_(N), we have

z=[w+(w mod p^(k))(p^(n)−1)]/p^(k)  (7.19)

with k=n−1.

To obtain the pilot elements matrix λ_(i) corresponding to L_(i) wewrite

z′=z mod p^(n−1)  (7.20)

in order to reveal all satellite elements accompanying each pilotelement. We then eliminate all the repeated entries in z′ and thecorresponding w values, retaining only pilot elements positions.Alternatively we simply force to zero the digit of weight n−2 in w andthat of weight n−1 in z.

The CGWP Factorization

We presently focus our attention on the matrices

 B_(k)=CH_(k); k=0,1, . . . ,n−1  (8.1)

In evaluating the pilot elements coordinates we begin by setting thenumber of processors M=1. The corresponding w-z relation of the pilotelements are thus evaluated with m=0. Once this relation has beenestablished it is subsequently used as the reference “w-z conversiontemplate” to produce the pilot element positions for a general number ofM=p^(m) processors. A “right” scan is applied to the matrix in order toproduce the w-z template with an ascending order of w. In this scanningtype the algorithm advances the first index w from zero selecting pilotelements by evaluating their displacement to the right as the secondindex z. Once the template has been evaluated the value m correspondingto the number of processors to be dispatched is used to performsuccessive p-ary divisions in proportion to m to assign the M processorswith maximum spacing, leading to maximum possible lengths of memoryqueues. A “down” scan is subsequently applied, where p-ary divisions areapplied successively while proceeding downward along the matrix columns,followed by a selection of the desired optimal scan.

The template evaluation and subsequent p-ary divisions for theassignment of the M processors through a right type scan produce thefollowing hypercube assignments. The assignments are as expectedfunctions of the four variables n, p, k and m. The conditions ofvalidity of the different assignments are denoted by numbers and lettersfor subsequent referencing. With K denoting the main clock, thefollowing hypercube transformations are obtained

K≅(j_(n−1) . . . j_(m+1)j_(m) ^(i) _(m−1) . . . i₁i₀)_(p)

K_({overscore (n−1)})≅(0j_(n−2) . . . j_(m+1)j_(m) ^(i) _(m−1) . . .i₁i₀)_(p)

K_({overscore (n−2)})≅(j_(n−1)0j_(n−3) . . . j_(m+1) ^(j) _(m)i_(m−1) .. . i₁i₀)_(p)  (8.2)

L k<n−2

(1) x: m=0

w≅K_({overscore (n−1)})  (8.3)

z≅[(I_(p) ^(_(k)) xP_(p) ^(_(n−k)) )K]_({overscore (n−2)})  (8.4)

(2) y: 1≦m≦n−k−2 $\begin{matrix}{w \simeq \left\lbrack {\left( {P_{p^{k + 1}} \times I_{p^{n - k - 1}}} \right){\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 1}}} & \text{(8.5)} \\{z \simeq \left\lbrack {P_{p^{n}}{\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 2}}} & \text{(8.6)}\end{matrix}$

(3) z: n−k−1≦m≦n−1 $\begin{matrix}{w \simeq \left\lbrack {\left( {P_{p^{k + 1}} \times I_{p^{n - k - 1}}} \right){\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 1}}} & \text{(8.7)} \\{z \simeq \left\lbrack {P_{p^{n}}{\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 2}}} & \text{(8.8)}\end{matrix}$

II. k=n−2

(1) u: m=0

w≅K_({overscore (n−1)})

z≅[(I_(p) ^(_(n−2)) xP_(p) ^(₂) )K]_({overscore (n−2)})  (8.9)

(2) v: m≧1 $\begin{matrix}{w \simeq \left\lbrack {\prod\limits_{t = 0}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}} \right\rbrack_{\overset{\_}{n - 1}}} & \text{(8.10)} \\{z \simeq \left\lbrack {P_{p^{n}}{\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 2}}} & \text{(8.11)}\end{matrix}$

t: k=n−1 $\begin{matrix}{w = {z \simeq \left\lbrack {\prod\limits_{t = 0}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}} \right\rbrack_{\overset{\_}{n - 1}}}} & \text{(8.12)}\end{matrix}$

Evaluated, these hypercubes yield the following pilot elementsassignments:

x: (k<n−2, m=0) $\begin{matrix}{w = {\sum\limits_{j = 0}^{n - 2}{p^{t}j_{t}}}} & \text{(8.13)} \\{z = {{\sum\limits_{j = 0}^{k - 1}{p^{t}j_{t}}} + {p^{n - 1}j_{k}} + {\sum\limits_{t = {k + 1}}^{n - 2}{p^{t - 1}j_{t}}}}} & \text{(8.14)}\end{matrix}$

y: k<n−2, 1≦m≦n−k−2 $\begin{matrix}{w = {{p^{k}i_{0}} + {\sum\limits_{s = 1}^{m - 1}{p^{n - 1 - s}i_{s}}} + {\sum\limits_{t = m}^{m + k - 1}{p^{t - m}j_{t}}} + {\sum\limits_{t = {m + k}}^{n - 2}{p^{t - m + 1}j_{t}}}}} & \text{(8.15)} \\{z = {{p^{n - 1}i_{0}} + {\sum\limits_{s = 1}^{m - 1}{p^{n - 2 - s}i_{s}}} + {\sum\limits_{t = m}^{n - 2}{p^{t - m}j_{t}}}}} & \text{(8.16)}\end{matrix}$

z: k<n−2, n−k−1≦m≦n−1 $\begin{matrix}{w = {{p^{k}i_{0}} + {\sum\limits_{s = 1}^{n - k - 2}{p^{n - 1 - s}i_{s}}} + {\sum\limits_{s = {n - k - 1}}^{m - 1}{p^{n - 2 - s}i_{s}}} + {\sum\limits_{s = m}^{n - 2}{p^{t - m}j_{t}}}}} & \text{(8.17)} \\{z = {{p^{n - 1}i_{0}} + {\sum\limits_{s = 1}^{m - 1}{p^{n - 2 - s}i_{s}}} + {\sum\limits_{t = m}^{n - 2}{p^{t - m}j_{t}}}}} & \text{(8.18)}\end{matrix}$

u: k=n−2, m=0 $\begin{matrix}{w = {\sum\limits_{t = 0}^{n - 2}{p^{t}j_{t}}}} & \text{(8.19)} \\{z = {{\sum\limits_{j = 0}^{n - 3}{p^{t}j_{t}}} + {p^{n - 1}j_{n - 2}}}} & \text{(8.20)}\end{matrix}$

v: k=n−2, m≧1 $\begin{matrix}{w = {{\sum\limits_{s = 0}^{m - 1}{p^{k - s}i_{s}}} + {\sum\limits_{t = m}^{n - 2}{p^{t - m}j_{t}}}}} & \text{(8.21)} \\{z = {{p^{n - 1}i_{0}} + {\sum\limits_{s = 1}^{m - 1}{p^{k - s}i_{s}}} + {\sum\limits_{t = m}^{n - 2}{p^{t - m}j_{t}}}}} & \text{(8.22)}\end{matrix}$

t: k=n−1 $\begin{matrix}{w = {z = {{\sum\limits_{s = 0}^{m - 1}{p^{n - 2 - s}i_{s}}} + {\sum\limits_{t = m}^{n - 2}{p^{t - m}j_{t}}}}}} & (8.23)\end{matrix}$

Optimal Assignment

A processor is considered optimal if it requires a minimum of memorypartitions, is shuffle free, meaning the absence of clock times useduniquely for shuffling and produces an ordered output given an orderedinput in Optimal Parallel and Pipelined Processing Through a New Classof Matrices with Application to Generalized Spectral Analysis”, MichaelJ. Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459. Wehave seen in Optimal Parallel and Pipelined Processing Through a NewClass of Matrices with Application to Generalized Spectral Analysis”,Michael J. Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp.443-459, that p²-optimal algorithms and processors lead to a minimumnumber of p² partitions of N/p² queue length each. With M=p^(m) base-pprocessors operating in parallel the number of partitions increases top^(m+2) and the queue length of each partition reduces to N/p^(m+2).

An optimal multiprocessing algorithm should satisfy such optimalityconstraints. The horizontal spacing between simultaneously accessedpilot elements defines the input memory queue length. The verticalspacing defines the output memory queue length. With M processorsapplied in parallel the horizontal spacing between the accessed elementswill be referred to as the “input pitch”, while the vertical spacing asthe “output pitch”.

By choosing the pilot elements leading to the maximum possible pitch,which is the highest of the two values: the minimum input pitch andminimum output pitch, optimality in the form of N/p^(m+2) queue lengthis achieved.

We note that Optimal Minimum memory queue length MMQL satisfies${MMQL} = \left\{ \begin{matrix}{p^{n - m - 2};} & {m \leq {n - 2}} \\{1;} & {m = {n - 1}}\end{matrix} \right.$

The following algorithm, Algorithm 2, describes this approach to stateassignment optimality.

Algorithm 2: Optimality search

begin

Extract pilots matrix

Apply right scan

Evaluate input pitch

Evaluate output pitch

p_(i,min)=min[input pitch]

p_(o,min)=min[output pitch]

p_(r,min)=min[p_(i,min), p_(o,min)]

Apply down scan

Evaluate output pitch

p_(i,min)=min[input pitch]

p_(o,min)=min[output pitch]

p_(d,min)=min[p_(i,min), p_(o,min)]

Optimal pitch=max[p_(d,min), p_(r,min)]

If p_(r,min)≧p_(d,min) then optimal=right scan Else optimal=down scan

Apply hypercube transformations

Dispatch and sequence M processors

end

In following the algorithm we note that in the validity condition y ofthe B_(k) matrix y: 1≦m≦n−k−2 the results obtained are such that thedigit i₀ of w is of a weight p^(k). Hence the input pitch is p^(k) whilethe output pitch which can be deduced from the position of i₀ in z isp^(n−1), that is, maximal possible. The input pitch is thus function ofk and can be low if k is small. By performing a down scan of B_(k) weobtain the following solution:

k<n−2

y: 1≦m≦n−k−2

w: 0 i₀ i₁ . . . i_(m−1) j_(n−2) . . . j_(m+1) j_(m)

z: j_(m+k) 0 i₀ i₁ . . . i_(m−1) j_(n−2) . . . j_(m+k+1) j_(m+k−1) . . .j_(m+1) j_(m)

where now it is i_(m−1) that leads to a minimum pitch and it has aweight of p^(n−m−1) in w and p^(n−m−2) in z. We deduce that the minimumpitch in this solution is p^(n−m−2), which is the optimal sought. Thesame reasoning leads to the optimal assignment for the case

k<n−2,

z: n−k−1≦m≦n−1

w: 0 i₀ i₁ . . . i_(m−1) j_(n−2) . . . j_(m+1) j_(m)

z: i_(n−2−k) 0 i₀ i₁ . . . i_(n−3−k) i_(n−1−k) i_(n−k) . . . i_(m−1 j)_(n−2) . . . j_(m+1 j) _(m)

These are the only two cases of the matrix that need be thus modifiedfor optimality. All results obtained above for the other validityconditions can be verified to be optimal.

Matrix Span

In the above from one iteration to the next the value of k isincremented. In each iteration once the pilot element matrix coordinates(w, z) are determined as shown above each processor accesses p elementsspaced by the row span starting with the pilot element and writes its poutputs at addresses spaced by the column span. The row and column spansof a matrix are evaluated as is shown in Optimal Parallel and PipelinedProcessing Through a New Class of Matrices with Application toGeneralized Spectral Analysis”, Michael J. Corinthios, IEEE Trans.Comput., Vol. 43, April 1994, pp. 443-459. In particular we note thatthe matrix

B_(k)=CH_(k)  (9.1)

has the same column span as that of C, namelyσ_(c)(B_(k))=σ_(c)(C)=p^(n−1). The row span of B_(k) is evaluated bynoticing that B_(k) has the same structure as C with its columnspermuted in accordance with the order implied by

H_(k) ⁻¹=I_(p) _(^(k)) xP_(p) _(^(n−k)) ⁻¹  (9.2)

The transformation of the hypercube (i_(n−1) . . . i₁i₀) correspondingto H_(k) ⁻¹ is one leading to a most significant digit equal to i_(n−2).Since this digit changes value from 0 to 1 in a cycle length of p^(n−2)we deduce that the row span of all the B_(k) matrices is simply

σ_(R)(B_(k))=p^(n−2)  (9.3)

Each processing element thus accesses p operands spaced p^(n−2) pointsapart and writes their p outputs at points which are p^(n−1) pointsapart.

The CGWK Factorization

The sampling matrices of the GWK factorization are more complex instructure than the other generalized spectral analysis matrices. Theyare defined by

Γ_(i)=P⁻¹G_(i)S_(i+1)  (11.1)

Let

L_(iΔ) P⁻¹G_(i)  (11.2)

we have

Γ_(i)=L_(i)S_(i+1)  (11.3)

We note that the sampling matrix G_(i) has the same structure in polesand zeros in Optimal Parallel and Pipelined Processing Through a NewClass of Matrices with Application to Generalized Spectral Analysis”,Michael J. Corinthios, IEEE Trans. Comput., Vol. 43, April 1994, pp.443-459, that is, in the positions of non-zero and zero elementsrespectively, as that of the matrix C_(N). We can write for the matrixG_(i)

w_(G) _(i) ≅(j_(n−2) . . . j₁j₀)

z_(G) _(i) ≅(j_(n−2) . . . j₁j₀)  (11.4)

as the pilot elements positions.

Given the definition of the matrix L_(i) a hypercube rotationcorresponding to the matrix P⁻¹ would yield the w and z values of L_(i)as:

w_(L) _(i) ≅(j_(n−2)0j_(n−3) . . . j₁j₀)

z_(L) _(i) =P⁻¹w_(L) _(i) ≅(0j_(n−3) . . . j₁j₀j_(n−2))  (11.5)

Alternatively, a z-ordered counterpart can be written as:

z_(L) _(i) ≅(0j_(n−2) . . . j_(i)j₀)

w_(L) _(i) ≅(j₀0j_(n−2) . . . j₂j₁)  (11.6)

Similarly, the matrix Γ₀=G₀S₁ which is obtained from G₀ by permuting itscolumns according to the order dictated by

 S₁ ⁻¹=P_(p) _(^(n−1)) ⁻¹xI_(p)  (11.7)

leads to the m=0 template assignment

w_(Γ) ₀ ≅(0j_(n−2) . . . j₁j₀)  (11.8)

z_(Γ) ₀ =S₁w_(Γ) ₀ ≅(0j₀j_(n−2) . . . j₂j₁)  (11.9)

and a similar z-ordered state assignment counter part.

For

Γ_(k)=G₀S_(k); k>0  (11.10)

we have

S_(k) ⁻¹=I_(p) _(^(k−1)) xP_(p) _(^(n−k)) ⁻¹xI_(p)  (11.11)

which leads to the state template assignment

w_(Γ) _(k) ≅w_(L) _(i) ≅(j_(n−2)0j_(n−3) . . . j₁j₀),

z_(Γ) _(k) =S_(k+1)z_(L) _(i) ≅(0j_(k−1)j_(n−3) . . .j_(k+1)j_(k)j_(k−2) . . . j₁j₀j_(n−2)); k>0.  (11.12)

With m made variable a right scan yields the following expressions forthe different validity conditions

The Γ_(k) Transformations

$\begin{matrix}{{1.\quad k} = 0} & \quad \\\begin{matrix}{{{a:k} = 0},{m = 0}} & {\quad {w \simeq K_{\overset{\_}{n - 1}}}} \\\quad & {\quad {z \simeq {P_{p^{n}}K_{\overset{\_}{n - 1}}} \equiv \left\lbrack {\left( {P_{p^{n - 1}} \times I_{p}} \right)K} \right\rbrack_{\overset{\_}{n - 1}}}}\end{matrix} & (11.13) \\\begin{matrix}{{{b:k} = 0},{m \geq 2}} & {\quad {w \simeq \left\lbrack {\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}} \right\rbrack_{\overset{\_}{n - 1}}}}\end{matrix} & (11.14) \\{\quad {z \simeq \left\lbrack {\prod\limits_{t = 0}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}} \right\rbrack_{\overset{\_}{n - 1}}}} & (11.15) \\{{2.\quad 1} \leq k \leq {n - 3}} & \quad \\\begin{matrix}{{{c:m} = 0}\quad} & {\quad {w \simeq \left\lbrack {\left( {I_{p^{n - 2}} \times P_{p^{2}}} \right)K} \right\rbrack_{\overset{\_}{n - 2}}}}\end{matrix} & (11.16) \\{\quad {z \simeq \left\lbrack {\left( {I_{p^{k}} \times P_{p^{n - k - 1}} \times I_{p}} \right)\left( {P_{p^{n - 1}}^{- 1} \times I_{p}} \right)K} \right\rbrack_{\overset{\_}{n - 1}}}} & (11.17) \\\begin{matrix}{{{d:m} = 1}\quad} & {\quad {w \simeq \left\lbrack {\left( {I_{p^{n - 2}} \times P_{p^{2}}} \right)\left( {P_{p^{k}} \times I_{p^{n - k}}} \right)K} \right\rbrack_{\overset{\_}{n - 2}}}}\end{matrix} & (11.18) \\{\quad {z \simeq \left\lbrack {\left( {I_{p} \times P_{p^{n - 2}} \times I_{p}} \right)\left( {P_{p^{n - 1}}^{- 1} \times I_{p}} \right)K} \right\rbrack_{\overset{\_}{n - 1}}}} & (11.19) \\\begin{matrix}{e:{m \geq 2}} & {\quad {z \simeq \left\lbrack {\left( {P_{p^{n - 1}} \times I_{p}} \right){\prod\limits_{t = 2}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 1}}}}\end{matrix} & (11.20) \\{{\left. \alpha \right)\quad m} \geq {n - k}} & \quad \\{\quad {w \simeq \left\lbrack {\left( {P_{p^{k}} \times I_{p^{n - k}}} \right){\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 2}}}} & (11.21) \\\begin{matrix}{{\left. \beta \right)\quad 2} \leq m \leq {n - k}} & {w \simeq \left\lbrack {\left( {P_{p^{k}} \times I_{p^{n - k}}} \right){\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 2}}}\end{matrix} & (11.22) \\{{3.\quad k} \geq {n - 2}} & \quad \\{\quad {w \simeq \left\lbrack {\left( {P_{p^{n - 2}} \times P_{p^{2}}} \right)K} \right\rbrack_{\overset{\_}{n - 2}}}} & (11.23) \\{\quad {z \simeq \left\lbrack {\left( {P_{p^{n - 1}}^{- 1} \times I_{p}} \right)K} \right\rbrack_{\overset{\_}{n - 1}}}} & (11.24) \\{{g:m} = {{1\quad w} \simeq \left\lbrack {\left( {I_{p^{2}} \times P_{p^{n - 2}}} \right)\left( {P_{p^{n - 2}} \times I_{p^{2}}} \right)K} \right\rbrack_{\overset{\_}{n - 2}}}} & (11.25) \\{\quad {z \simeq \left\lbrack {\left( {P_{p^{n - 2}}^{- 1} \times I_{p^{2}}} \right)\left( {P_{p^{n - 1}} \times I_{p}} \right)K} \right\rbrack_{\overset{\_}{n - 1}}}} & (11.26) \\{h:{m \geq {2\quad w} \simeq \left\lbrack {\left( {P_{p^{n - 2}} \times I_{p^{2}}} \right){\prod\limits_{t = 1}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 2}}}} & (11.27) \\{i:{2 \leq m \leq {n - {2\quad z}} \simeq \left\lbrack {\left( {P_{p^{n - 1}} \times I_{p}} \right){\prod\limits_{t = 2}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 1}}}} & (11.28) \\{{j:m} = {{n - {1\quad z}} \simeq \left\lbrack {\left( {P_{p^{n - 1}} \times I_{p}} \right){\prod\limits_{t = 2}^{m - 1}{\left( {I_{p^{t}} \times P_{p^{n - t - 1}} \times I_{p}} \right)K}}} \right\rbrack_{\overset{\_}{n - 1}}}} & (11.29)\end{matrix}$

Optimal Assignments

A “down” scan of the Γ_(k) matrix yields optimal assignments for twovalidity conditions:

1. k=0

a: k=0, m=1

w: 0 i₀ j_(n−2) . . . . j₂ j₁

z: 0 j₁ i₀ j_(n−2) . . . j₃ j₂

b: k=0, m≧2

w: 0 i₀ i₁ . . . i_(m−1) j_(n−2) . . . j_(m+1) j_(m)

z: 0 j_(m) i₀ i₁ . . . i_(m−2) i_(m−1) j_(n−2) . . . j_(m+1)

All other assignments generated by the “right” scan are optimal and neednot be replaced.

The CGWK Matrix Spans

Using the same approach we deduce the spans of the different CGWKfactorization matrices.

We have

σ_(R)(L_(i))=σ_(R)(G_(i))=p^(n−1)  (11.30)

σ_(c)(L_(i))=p^(n−2)  (11.31)

σ_(R)(Γ₀)=p^(n−1)  (11.32)

σ_(c)(Γ₀)=σ_(c)(G₀)=p^(n−1)  (11.33)

and

σ_(R)(Γ_(i))=p^(n−1)  (11.34)

σ_(c)(Γ_(i))=σ_(c)(P⁻¹G_(i))=σ_(c)(L_(i))=p^(n−2)  (11.35)

Example 10.1

With N=16 and M=p^(m) the pilots matrices β_(k,m) for different valuesof k and m are deduced from the results shown above. In what follows thepilot elements' positions thus evaluated, associated with each β_(k,m)and the processor dispatched thereat at the appropriate clock are listedbelow for some values of k and m. $\begin{matrix}{\beta_{0,1}:\begin{bmatrix}P_{00} & \quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & P_{01} & \quad & \quad & \quad & \quad \\\quad & P_{02} & \quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & P_{03} & \quad & \quad & \quad \\\quad & \quad & P_{10} & \quad & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & \quad & P_{11} & \quad & \quad \\\quad & \quad & \quad & P_{12} & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad & P_{13} & \quad\end{bmatrix}} \\{\beta_{2,3}:\begin{bmatrix}P_{00} & \quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & P_{40} & \quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & P_{20} & \quad & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & P_{60} & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & P_{10} & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & P_{50} & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & \quad & P_{30} & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & \quad & \quad & P_{70} & \quad\end{bmatrix}} \\{B_{3,2}:\begin{bmatrix}P_{00} & \quad & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & P_{01} & \quad & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & P_{20} & \quad & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & P_{21} & \quad & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & P_{10} & \quad & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & P_{11} & \quad & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & P_{30} & \quad \\\quad & \quad & \quad & \quad & \quad & \quad & \quad & P_{31}\end{bmatrix}}\end{matrix}$

Example 10.2

For the matrix B_(k) with k=1, N=729 and M=9 we have

w={0, 81, 162, 27, 108, 189, 54, 135, 216, 1, 83, 163, 28, . . . ,

2, 83, 164, . . . , 218, 3, 84, 165, . . . , 18, 99, 180, . . . }

z={0, 27, 54, 9, 36, 63, 18, 45, 72, 1, 28, 55, 10, . . . , 2, 29, 56, .. . ,

74, 243, 270, 297, . . . , 6, 33, 60, . . . }

Nine elements are dispatched in one real time clock. The memory minimumqueue length MMQL=minimum pitch=9=3^(n−2−m), confirming the optimalityof the state assignment.

Example 10.3

For the matrix B_(k) with k=2, N=729 and M=243 processors we have

w={0, 81, 162, 27, 108, 189, 54, 135, 216, 9, 90, 171, 117, . . . ,

18, 99, 180, . . . , 3, 84, 165, . . . , 6, 87, 168, . . . , 1, 82, 163,. . . , 2, 83, 164, . . . }

z={0, 27, 54, 9, 36, 163, 18, 45, 72, 243, 270, 297, 252, . . . ,

486, 513, 640, . . . , 3, 30, 57, . . . , 6, 33, 60, . . . , 1, 28, 55,. . . 2, 29, 56, . . . }

MMQL=1. We note that if M=81 we obtain the same w and z values but here81 pilot elements are dispatched in one clock rather than 243 as is thecase for m=5. With m=4 the MMQL=3.

Example 10.4

For the matrix Γ_(k) with k=3, N=729 and M=3. The “right” scanemphasizing scanning the upper rows before performing p-ary divisionfrom the top down using the above Γ_(k) results we obtain

w={0, 9, 18, 1, 10, 19, 2, 11, 20, . . . , 8, 17, 26, 27, 36, 45, 54,63, 72, . . . ,

57, 66, 165, . . . , 243, 252, 261, 244, 253, . . . , }

z={0, 81, 162, 3, 84, 165, 6, 87, 168, . . . , 24, 105, 186, 27, 108,189,

54, 135, 216, . . . , 141, 222, 403, . . . , 1, 82, 163, 4, 85, . . . }

We note that:

MMQL=minimum pitch=9

With m=1 the optimal memory queue length=27. Using a “down” scan,applying a p-ary division from top down we obtain the optimal assignmentby a simple shuffle of the above values:

w={0, 27, 54, 1, 28, 55, . . . , 8, 35, 62, 9, 36, 63, 10, 37, 56, . . .}

z={0, 27, 54, 3, 30, 57, 6, 33, 60, 9, . . . , 24, 51, 78, 81, 108, 135,84, 111, 138, . . . }

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1: The initial dispatching of processors indicated by numbers 1, 2,3, affixed next to the assigned pilot elements at clock zero for thecase N=27, p=3, n=3 and M=3 of the optimal factorizations of matricesQ_(k), with k=0.

FIG. 2: The initial dispatching of processors indicated by numbers 1, 2,3, affixed next to the assigned pilot elements at clock zero for thecase N=27, p=3, n=3 and M=3 of the optimal factorizations of matricesQ_(k), with k=1.

FIG. 3: The initial dispatching of processors indicated by numbers 1, 2,3, affixed next to the assigned pilot elements at clock zero for thecase N=27, p=3, n=3 and M=3 of the optimal factorizations of matricesQ_(k), with k=2.

FIG. 4: The initial dispatching for the optimal factorization of matrixΓ_(k) with k=2, where the processing elements are represented by circlesand those selected at clock zero are shown with the numbers 1, 2 and 3affixed next to them.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIGS. 1, 2 and 3 show the initial dispatching of processors indicated bynumbers 1, 2, 3, affixed next to the assigned pilot elements at clockzero for the case N=27, p=3, n=3 and M=3 of the optimal factorizationsof matrices Q_(k), with k=0, 1 and 2, respectively.

FIG. 4 shows the corresponding dispatching for the optimal factorizationof matrix Γ_(k) with k=2, where the processing elements are representedby circles and those selected at clock zero are shown with the numbers1, 2 and 3 affixed next to them.

It is noted that with larger values of N, as shown in the aboveexamples, the optimal dispatching is obtained by a ‘down’ scan ratherthan a ‘right’ scan, that is, by following the state assignmentalgorithm. It is also noted that other state assignments may be appliedbut the proposed approach is optimal and any other approach would beeither less optimal or at best equivalent to this proposed approach.

A complete solution of state assignment and sequencing of a singlebase-p processor to the ultimate massively parallel M=p^(n−1) processorshas been presented. Pilot elements addresses and matrix spans to locatetheir satellites are automatically generated for dispatching andsequencing the parallel processors. Applications are shown on imageprocessing and generalized spectral analysis optimal transforms. Thesame approach can be directly applied to cases where the matrix is postrather than pre multiplied by shuffle matrices and vice versa. It canalso be applied to matrices of general structure and to sub-optimalalgorithms, where the span is neither p-optimal nor p²-optimal.

REFERENCES

[1] M. J. Corinthios, “Optimal Parallel and Pipelined Processing Througha New Class of Matrices with Application to Generalized SpectralAnalysis”, IEEE Trans. Comput., Vol. 43, April 1994, pp. 443-459.

[2] M. J. Corinthios, “3-D cellular arrays for parallel/cascadeimage/signal processing”, in Spectral Techniques and Fault Detection, M.Karpovsky, Ed. New York: Academic Press, 1985.

[3] M. J. Corinthios, “The Design of a class of fast Fourier TransformComputers”, IEEE Trans. Comput., Vol. C-20, pp. 617-623, June 1971.

[4] G. Hasteer and P. Banerjee, “A Parallel Algorithm for StateAssignment of Finite State Machines”, IEEE Trans. Comput., vol. 47, No.2, pp. 242-246, February 1998.

[5] O. A. Mc Bryan and E. F. Van De Velde, “Hypercube Algorithms andImplementations”, SIAM J. Sci. Stat. Comput., Vol. 8, No. 2, pp.s227-287, March 1987.

[6] H. S. Stone, “Parallel Processing with the Perfect”, IEEE Trans.Comput. Vol. C-20, No. 2, pp. 153-161, February 1971.

[7] V. Kumar, A. Grama, A. Gupta and G. Karypis, “Introduction toParallel Computing”, Benjamin/Cummings, Redwood, Calif., 1994.

[8] K. E. Batcher, “Design of a Massively Parallel Processor”, IEEETrans. Comput, pp 836-840, September 1980.

[9] H. S. Stone, “High-Performance Computer Architecture”,Addisson-Wesley, Reading, Mass., 1993.

[10] K. Hwang, “Advanced Computer Architecture: Parallelism,Scalability, Programmability”, McGraw Hill, New York, N.Y., 1993.

[11] D. H. Lawrie, “Access and Alignment of Data in an Array Processor”,IEEE Trans. Comput., vol C-24, No. 2, December 1975, pp 1145-1155.

[12] Roziner, T. D., Karpovsky, M. G., and Trachtenberg, L. A., “FastFourier Transforms over Finite Groups by Multiprocessor Systems”, IEEETrans. Accous., Speech, and Sign. Proc., ASSP, vol. 38, No. 2, February1990, pp 226-240.

[13] Taylor, G. F., Steinvorth, R. H., and MacDonald J., “AnArchitecture for a Video Rate Two-Dimensional Fast Fourier Transformprocessor”, IEEE Trans. Comput., vol. 37, No. 9, September 1988, pp1145-1151.

[14] Jou, Y.-Y. and Abraham, J. A., Fault tolerant FFT Networks”, IEEETrans. Comput., vol. 37, No. 5, May 1988, pp. 548-561.

[15] Moraga, Claudio, “Design of Multiple-Valued Systolic System for theComputation of the Chrestenson Spectrum”, IEEE Trans. Comput., Vol.C-35, No. 2, February 1986, pp 183-188.

[16] Sloate, H., “Matrix Representation for Sorting and the Fast FourierTransform”, IEEE Trans. Circ. And Syst., Vol. CAS-21, No. 1, January1974, pp 109-116.

[17] Galles, Michael B., “Hierarchical Fat Hypercube Architecture forParallel Processing Systems”, U.S. Pat. No. 5,699,008, September 1997.

[18] Ho, Ching-Tien, “Method for Performing Matrix Transposition on aMesh Multiprocessor . . . with Concurrent execution of Multiprocessors”,U.S. Pat. No. 5,644,517, July 1997.

[19] Cypher, Robert. E., “Hierarchical Network architecture for ParallelProcessing Having Interconnection Between Bit-Addressible Modes Based onAddress Bit Permutations”, U.S. Pat. No. 5,513,371, April 1996.

[20] Swartztrauber, Paul-Noble, “Multipipeline Multiprocessor System”,U.S. Pat. No. 5,689,722, November 1997.

[21] Kogge, Peter, “Dynamic Mutiple Parallel Processing Array”, U.S.Pat. No. 5,475,856, December 1995.

[22] Shyu, Rong-Fuh, “Recycling and Parallel Processing Method . . . forPerforming Discrete Cosine Transform and its Inverse”, “U.S. Pat. No.5,471,412, November 1995.

[23] Brantly, Jr., William, C., McAuliffe, K. P., Norton, V. A.,Pfister, G. F. and Weiss, J., U.S. Pat. No. 4,980,822, December 1990.

[24] Morton, Steven, G., “Single Instruction Multiple Data streamCellular Array Processing Apparatus Employing Multiple State Logic forCoupling to Data Buses”, U.S. Pat. No. 4,916,657, April 1990.

[25] Luc Mary and Barazesh, Balman, “Processor for Signal processing andHierarchical Multiprocessing Structure Including At Least One SuchProcessor”, U.S. Pat. No. 4,845,660, July 1989.

[26] Corinthios, Michael J., “General Base State Assignment for MassiveParallelism”, submitted for consideration toward publication, IEEEETrans. Comput., May, 1998, pp 1-30.

I claim:
 1. A processor comprising general base processing elements anda partitioned memory, said processor being configured using a pilotsmatrix and a general base, denoted p, hypercube transformations where pis an arbitrary integer, for dispatching and sequencing M=p^(m) generalbase p processing elements, m being an integer and effectingcontention-free memory partitioning for parallel processing of productsof general base p factorizations and decompositions of N×N matriceswhere N=p^(n), n being an arbitrary integer.
 2. A processor comprisinggeneral base processing elements and a partitioned memory, saidprocessor being configured using a pilots matrix and a general base,denoted p, hypercube transformations where p is an arbitrary integer,for dispatching and sequencing M=p^(m) general base p processingelements, m being an integer and effecting contention-free memorypartitioning for parallel processing of products of general basefactorizations of N×N matrices where N=p^(n), n being an arbitraryinteger, applied to one of a Generalized-Walsh-Chrestenson Transformmatrix and a Fourier Transform matrix.
 3. A processor as in claim 1,applied to image processing.
 4. A process as in claim 2, applied toimage processing.